Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback

نویسندگان

  • G. Deconinck
  • J. Vounckx
  • R. Lauwereins
  • J. A. Peperstraete
چکیده

For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpointing and rollback, is often used. During failurefree operation, the process states are regularly saved, and after a fault is detected, the system is rolled back to a previously saved state. We can distinguish four classes of techniques: semi-automatic techniques, message logging, coordinated checkpointing and hybrid techniques. In this paper a survey is given, and the possibly involved overhead is discussed. This will allow the user to choose an optimal checkpointing and rollback technique for given facilities and applications. The rest of the paper is organised as follows. The overhead (in time, storage, communication) and other requirements and aspects to compare the techniques are presented in section 2. The semiautomatic technique is discussed in section 3, while the classes of user-transparent checkpointing techniques are described in section 4. A comparison is made in section 5, after which some general conclusions are presented in section 6. 2. Aspects for Comparison In this section, topics to compare the different checkpointing and rollback techniques are considered. The overhead, capabilities and constraints of the algorithms must be examined. Keywords—fault-tolerance, backward error recovery, checkpointing, rollback

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers

Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determin...

متن کامل

The FTMPS-Project: Design and Implementation of Fault-Tolerance Techniques for Massively Parallel Systems

The FTMPS-project provides a solution to the need for faulttolerance in large systems . A complete fault-tolerance approach is developed and being implemented . The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as perananent failures . Combined with the diagnosis software, the necessary information for t...

متن کامل

The FTMPS { Project : Design and Implementation of Fault { Tolerance Techniques for Massively Parallel Systems 1

The FTMPS-project provides a solution to the need for fault{ tolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the...

متن کامل

A backward/forward recovery approach for the preconditioned conjugate gradient method

Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP’13, pp. 167–176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every d iterations, and to checkpoint every c × d iterations. W...

متن کامل

A Survey and Performance Analysis of Checkpointing and Recovery Schemes for Mobile Computing Systems

A SURVEY AND PERFORMANCE ANALYSIS OF CHECKPOINTING AND RECOVERY SCHEMES FOR MOBILE COMPUTING SYSTEMS Ruchi Tuli1 and Parveen Kumar2 1Yanbu University College, Royal Commission for Jubail and Yanbu, Directorate General for Yanbu, P.O. Box 30436 Madinat Yanbu Al Sinaiyah Kingdom of Saudi Arabia., E-mail : [email protected] 2Merrut Institute of Engineering and Technology, Merrut (INDIA) E-mail ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993